-
Notifications
You must be signed in to change notification settings - Fork 120
Whitelisting Onelake API & Workspace PL FQDNs #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @SmritiAgrawal04
This PR needs some tests to show it working I think
crepererum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do somewhat agree with @alamb. I know it's close to impossible to write an integration test for this, but maybe we can at least have a unit test for parse_url?
src/azure/builder.rs
Outdated
| let first_label = host.split('.').next().unwrap_or_default(); | ||
| self.account_name = Some(validate(first_label)?); | ||
|
|
||
| let container = parsed.path_segments().unwrap().next().expect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: onelake uses workspace terminology
src/azure/builder.rs
Outdated
| // Regex to match WS-PL FQDN: "{workspaceid}.z??.dfs.fabric.microsoft.com" | ||
| // workspaceid = 32 hex chars, z?? = z + first two chars of workspaceid | ||
| lazy_static::lazy_static! { | ||
| static ref WS_PL_REGEX: Regex = Regex::new(r"^(?P<workspaceid>[0-9a-f]{32})\.z(?P<xy>[0-9a-f]{2})\.(dfs|blob)\.fabric\.microsoft\.com$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add support for .onelake.fabric.microsoft.com also
src/azure/builder.rs
Outdated
| let xy = captures.name("xy").unwrap().as_str(); | ||
|
|
||
| // Validate z?? matches first 2 chars of workspaceid | ||
| if &workspaceid[0..2] != xy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this validation
src/azure/builder.rs
Outdated
| } | ||
|
|
||
| // Otherwise, check Fabric global / Onelake API FQDN | ||
| if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are checking for global endpoint, should these not be with not like (!host.ends_with(DFS_FABRIC_SUFFIX) && !host.ends_with(BLOB_FABRIC_SUFFIX))
src/azure/builder.rs
Outdated
| if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) { | ||
| let labels: Vec<&str> = host.split('.').collect(); | ||
| let account_name = if labels.len() >= 2 && labels[0].contains("api") && labels[1] == "onelake" { | ||
| format!("{}-{}", labels[0], labels[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are referring workspace id as account_name, then this will not work for non pl scenario. In those case it will give account_name as "westus-api-onelake"
|
Hi @alamb & @crepererum, I added the unit tests as suggested. I request to review the PR please. |
src/azure/builder.rs
Outdated
| let mut builder = MicrosoftAzureBuilder::new(); | ||
| builder | ||
| .parse_url("https://account.blob.fabric.microsoft.com/container") | ||
| .parse_url("https://account.blob.fabric.microsoft.com/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this test case change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By mistake. Reverted.
| } | ||
| Some((a, "dfs.fabric.microsoft.com")) | Some((a, "blob.fabric.microsoft.com")) => { | ||
| self.account_name = Some(validate(a)?); | ||
| // Attempt to infer the container name from the URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this comment? It seems helpful
| } | ||
| "https" => { | ||
| // Regex to match WS-PL FQDN: | ||
| // "{workspaceid}.z??.(onelake|dfs|blob).fabric.microsoft.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also add an example URL for each of the APIs you are adding support for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for WS-PL DFS/ Blob endpoints. We are waiting for PM to confirm on ABFSS & WS-PL onelake domains.
| self.container_name = Some(validate(parsed.username())?); | ||
| self.account_name = Some(validate(a)?); | ||
| self.use_fabric_endpoint = true.into(); | ||
| } else if let Some(a) = host.strip_suffix("-api.onelake.fabric.microsoft.com") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft doc so we can cite it in code/tests?
I don't see it in https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SmritiAgrawal04 -- i left some more questions
BTW I asked codex for a review of this PR like this
please review this PR (checked out in this directory): https://github.com/apache/arrow-rs-object-store/pull/552
Specifically ensure that the format of the URLs supported matches publically published microsoft urls and provide documentation references
And this was the output which I found helpful
───────────────────────────────────────────────────────────────────────────────────────────────
• Findings
- High: -api.onelake.fabric.microsoft.com is accepted for both abfs(s) and https, but I can’t find this FQDN in
the public OneLake URL docs. The published endpoints are onelake.dfs.fabric.microsoft.com and workspace-scoped
https://{workspaceId}.z{xy}.dfs.fabric.microsoft.com/... (see https://learn.microsoft.com/en-us/fabric/onelake/
onelake-access-api). If -api.onelake is internal or not publicly documented, this should be gated or removed. - Medium: The WS‑PL regex no longer verifies that z{xy} matches the first two hex chars of the workspace id (src/
azure/builder.rs:692-708). The published WS‑PL format is https://
{workspaceId}.z{first2}.dfs.fabric.microsoft.com (same doc). Without the check, non‑published FQDNs are
accepted. - Medium: WS‑PL regex allows .onelake.fabric.microsoft.com (src/azure/builder.rs:692-699). The public docs list
dfs (and blob endpoints for blob access), but I don’t see onelake as a WS‑PL subdomain. Please confirm with a
Microsoft reference or remove to match published formats. - Low: WS‑PL regex only allows lowercase hex ([0-9a-f]{32}) (src/azure/builder.rs:697). If Microsoft treats
workspace IDs as case‑insensitive GUIDs in URLs, this will reject valid uppercase forms; consider documenting
that only lowercase is supported or make the regex case‑insensitive.
Doc references
- OneLake access API URL formats: https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
- Private Link / workspace-specific OneLake endpoints (if this PR targets WS‑PL): please confirm the exact doc
section that defines the workspaceId.z{xy}.dfs.fabric.microsoft.com pattern and whether any onelake WS‑PL host
is documented (I could not find it in public docs).
Questions / assumptions
- Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft
doc so we can cite it in code/tests? - Should WS‑PL accept only dfs/blob subdomains, or is .onelake.fabric.microsoft.com explicitly published?
|
FYI #604 may be related. |
Hi @alamb, I have addressed all comments. About, finding #1 & #3, we plan to add it to the public documentation, the PR for which is already out. I request to approve these changes meanwhile. Thanks |
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?